Loan Default Prediction¶

Problem Definition¶

The Context:¶

  • Why is this problem important to solve?

    The classification of which customers can have a loan from the bank is important so the bank does not lend money to customers that are defaulters (customers who don't pay back). Since defaulters are one of the main reasons for the revenue loss on banks. But also keeping the customers that are likely to pay back. Using machine learning this problem can be solved with less human error and less biases.

The objective:¶

  • What is the intended goal?

    The goal is to make a model that can predict the customer's classification of being defaulters or not. This model will alsoo give certaing recommendations to the bank which are in the conclusion and recommendations section.

The key questions:¶

  • What are the key questions that need to be answered?

What are the main keys for a customer to be defaulter?

The problem formulation:¶

  • What is it that we are trying to solve using data science?

We are trying to solve the decision making of giving a bank loan. This decision making using data science will be faster and less biased than if this same decision makinng was made by a human.

Data Description:¶

The Home Equity dataset (HMEQ) contains baseline and loan performance information for 5,960 recent home equity loans. The target (BAD) is a binary variable that indicates whether an applicant has ultimately defaulted or has been severely delinquent. This adverse outcome occurred in 1,189 cases (20 percent). 12 input variables were registered for each applicant.

  • BAD: 1 = Client defaulted on loan, 0 = loan repaid

  • LOAN: Amount of loan approved.

  • MORTDUE: Amount due on the existing mortgage.

  • VALUE: Current value of the property.

  • REASON: Reason for the loan request. (HomeImp = home improvement, DebtCon= debt consolidation which means taking out a new loan to pay off other liabilities and consumer debts)

  • JOB: The type of job that loan applicant has such as manager, self, etc.

  • YOJ: Years at present job.

  • DEROG: Number of major derogatory reports (which indicates a serious delinquency or late payments).

  • DELINQ: Number of delinquent credit lines (a line of credit becomes delinquent when a borrower does not make the minimum required payments 30 to 60 days past the day on which the payments were due).

  • CLAGE: Age of the oldest credit line in months.

  • NINQ: Number of recent credit inquiries.

  • CLNO: Number of existing credit lines.

  • DEBTINC: Debt-to-income ratio (all your monthly debt payments divided by your gross monthly income. This number is one way lenders measure your ability to manage the monthly payments to repay the money you plan to borrow.

Import the necessary libraries and Data¶

In [1]:
import warnings
warnings.filterwarnings("ignore")

# Libraries for data manipulation and visualization
import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.model_selection import train_test_split

# Algorithms to use
from sklearn.tree import DecisionTreeClassifier

from sklearn import tree

from sklearn.ensemble import RandomForestClassifier

# Metrics to evaluate the model
from sklearn.metrics import confusion_matrix, classification_report, f1_score, recall_score

from sklearn import metrics

# For hyperparameter tuning
from sklearn.model_selection import GridSearchCV
In [2]:
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

Data Overview¶

  • Reading the dataset
  • Understanding the shape of the dataset
  • Checking the data types
  • Checking for missing values
  • Checking for duplicated values
In [3]:
# Save the name of the file path to make it easier for modifications
file_path = '/content/drive/MyDrive/Colab Notebooks/Capstone Project/hmeq.csv'
In [4]:
# Read the file
Loan_Default = pd.read_csv(file_path)
In [5]:
# Make a copy of the data frame for modifications
data = Loan_Default.copy()
In [6]:
# Looking at the first 5 elements of the data
data.head()
Out[6]:
BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC
0 1 1100 25860.0 39025.0 HomeImp Other 10.5 0.0 0.0 94.366667 1.0 9.0 NaN
1 1 1300 70053.0 68400.0 HomeImp Other 7.0 0.0 2.0 121.833333 0.0 14.0 NaN
2 1 1500 13500.0 16700.0 HomeImp Other 4.0 0.0 0.0 149.466667 1.0 10.0 NaN
3 1 1500 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 0 1700 97800.0 112000.0 HomeImp Office 3.0 0.0 0.0 93.333333 0.0 14.0 NaN
In [7]:
# Looking at the last 5 elements of the data
data.tail()
Out[7]:
BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC
5955 0 88900 57264.0 90185.0 DebtCon Other 16.0 0.0 0.0 221.808718 0.0 16.0 36.112347
5956 0 89000 54576.0 92937.0 DebtCon Other 16.0 0.0 0.0 208.692070 0.0 15.0 35.859971
5957 0 89200 54045.0 92924.0 DebtCon Other 15.0 0.0 0.0 212.279697 0.0 15.0 35.556590
5958 0 89800 50370.0 91861.0 DebtCon Other 14.0 0.0 0.0 213.892709 0.0 16.0 34.340882
5959 0 89900 48811.0 88934.0 DebtCon Other 15.0 0.0 0.0 219.601002 0.0 16.0 34.571519

Summary Statistics¶

In [8]:
# Looking at the info of the dataset
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5960 entries, 0 to 5959
Data columns (total 13 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   BAD      5960 non-null   int64  
 1   LOAN     5960 non-null   int64  
 2   MORTDUE  5442 non-null   float64
 3   VALUE    5848 non-null   float64
 4   REASON   5708 non-null   object 
 5   JOB      5681 non-null   object 
 6   YOJ      5445 non-null   float64
 7   DEROG    5252 non-null   float64
 8   DELINQ   5380 non-null   float64
 9   CLAGE    5652 non-null   float64
 10  NINQ     5450 non-null   float64
 11  CLNO     5738 non-null   float64
 12  DEBTINC  4693 non-null   float64
dtypes: float64(9), int64(2), object(2)
memory usage: 605.4+ KB

Observations:

There are 5960 entries and not all of the columns have information in all of the entries. Some columns have missing information, such as DEBTINC which is the column that has less entries.

In [9]:
data.describe().T
Out[9]:
count mean std min 25% 50% 75% max
BAD 5960.0 0.199497 0.399656 0.000000 0.000000 0.000000 0.000000 1.000000
LOAN 5960.0 18607.969799 11207.480417 1100.000000 11100.000000 16300.000000 23300.000000 89900.000000
MORTDUE 5442.0 73760.817200 44457.609458 2063.000000 46276.000000 65019.000000 91488.000000 399550.000000
VALUE 5848.0 101776.048741 57385.775334 8000.000000 66075.500000 89235.500000 119824.250000 855909.000000
YOJ 5445.0 8.922268 7.573982 0.000000 3.000000 7.000000 13.000000 41.000000
DEROG 5252.0 0.254570 0.846047 0.000000 0.000000 0.000000 0.000000 10.000000
DELINQ 5380.0 0.449442 1.127266 0.000000 0.000000 0.000000 0.000000 15.000000
CLAGE 5652.0 179.766275 85.810092 0.000000 115.116702 173.466667 231.562278 1168.233561
NINQ 5450.0 1.186055 1.728675 0.000000 0.000000 1.000000 2.000000 17.000000
CLNO 5738.0 21.296096 10.138933 0.000000 15.000000 20.000000 26.000000 71.000000
DEBTINC 4693.0 33.779915 8.601746 0.524499 29.140031 34.818262 39.003141 203.312149

Observations:

The mean for the people that are defaulters is low, meaning the amount of defaulters is low.

The loan amount has an average of $18,608 among the users.

The mortgage due amount has the mean of $73,760, but some values are missing.

The value mean is $101,776

The years at a job is around 8.9 years

Number of major derogatory reports is of 0.25, which is less than half but still could be considered high due to the meaning of these reports.

Number of delinquent credit lines has a mean of 0.44, almost half which is high due to the fact that users are not paying the minimum amount required.

The mean of credits being active is around 21 months.

The Debt-to-income ratio has a mean of 33.7 which means that the debt is usually higher than the monthly gross income of each user.

In [10]:
# Making a list for the variables that are numerical
num_cols = data.select_dtypes('number').columns
num_variable_list = list(data[num_cols].columns.values.tolist())
num_variable_list
Out[10]:
['BAD',
 'LOAN',
 'MORTDUE',
 'VALUE',
 'YOJ',
 'DEROG',
 'DELINQ',
 'CLAGE',
 'NINQ',
 'CLNO',
 'DEBTINC']
In [11]:
# Making a list for the variables that are object/strings
cat_cols = data.select_dtypes('object').columns # Selecting numerical columns
cat_variable_list = list(data[cat_cols].columns.values.tolist())
cat_variable_list
Out[11]:
['REASON', 'JOB']
In [12]:
# Getting the percentages of each categorical feature
for column in cat_cols:
    print("-----",column,"-----")
    print(data[column].value_counts(normalize = True))
    print("-" * 50)
----- REASON -----
DebtCon    0.688157
HomeImp    0.311843
Name: REASON, dtype: float64
--------------------------------------------------
----- JOB -----
Other      0.420349
ProfExe    0.224608
Office     0.166872
Mgr        0.135011
Self       0.033973
Sales      0.019187
Name: JOB, dtype: float64
--------------------------------------------------

Observations:

The reason of debt is mainly because of 'DebtCon'. And the job description is mostly 'Other' and the second highest is 'ProfExe', maybe the bank should have more options so less users click on 'Other'.

In [13]:
# Count the NaN values in each column
null_count = data.isna().sum()
print(null_count.sort_values())
BAD           0
LOAN          0
VALUE       112
CLNO        222
REASON      252
JOB         279
CLAGE       308
NINQ        510
YOJ         515
MORTDUE     518
DELINQ      580
DEROG       708
DEBTINC    1267
dtype: int64
In [14]:
# This is the null count in percentage
for i in null_count:
  new_null_count = null_count/5960
print(new_null_count.sort_values())
BAD        0.000000
LOAN       0.000000
VALUE      0.018792
CLNO       0.037248
REASON     0.042282
JOB        0.046812
CLAGE      0.051678
NINQ       0.085570
YOJ        0.086409
MORTDUE    0.086913
DELINQ     0.097315
DEROG      0.118792
DEBTINC    0.212584
dtype: float64

Observations:

Due to the high missing values on some of the columns, I find it more valuable to start the missing value treatment before doing the EDA.

missing value treatment

In [15]:
# First we convert the BAD column to categorical.

cols_cat = data.select_dtypes(['object']).columns.tolist()
cols_cat.append('BAD') # We add BAD

# For loop will convert all the entries to str so they are treated as categorical variables
data['BAD']=data['BAD'].astype(np.object)

print(data.info()) # We confirm that BAD is an object dtype.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5960 entries, 0 to 5959
Data columns (total 13 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   BAD      5960 non-null   object 
 1   LOAN     5960 non-null   int64  
 2   MORTDUE  5442 non-null   float64
 3   VALUE    5848 non-null   float64
 4   REASON   5708 non-null   object 
 5   JOB      5681 non-null   object 
 6   YOJ      5445 non-null   float64
 7   DEROG    5252 non-null   float64
 8   DELINQ   5380 non-null   float64
 9   CLAGE    5652 non-null   float64
 10  NINQ     5450 non-null   float64
 11  CLNO     5738 non-null   float64
 12  DEBTINC  4693 non-null   float64
dtypes: float64(9), int64(1), object(3)
memory usage: 605.4+ KB
None
In [16]:
#  Treat Missing values in numerical columns with median and mode in categorical variables

# Select numeric columns
num_data = data.select_dtypes('number')

# Select string and object columns
cat_data = data.select_dtypes('object').columns.tolist()

# Fill numeric features with median

data[num_data.columns] = num_data.fillna(num_data.median())

for column in cat_data:
    mode = data[column].mode()[0]
    data[column] = data[column].fillna(mode)
    #print(column,mode) #to check the iterations
In [17]:
data.head()
Out[17]:
BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC
0 1 1100 25860.0 39025.0 HomeImp Other 10.5 0.0 0.0 94.366667 1.0 9.0 34.818262
1 1 1300 70053.0 68400.0 HomeImp Other 7.0 0.0 2.0 121.833333 0.0 14.0 34.818262
2 1 1500 13500.0 16700.0 HomeImp Other 4.0 0.0 0.0 149.466667 1.0 10.0 34.818262
3 1 1500 65019.0 89235.5 DebtCon Other 7.0 0.0 0.0 173.466667 1.0 20.0 34.818262
4 0 1700 97800.0 112000.0 HomeImp Office 3.0 0.0 0.0 93.333333 0.0 14.0 34.818262

Exploratory Data Analysis (EDA) and Visualization¶

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Leading Questions:

  1. What is the range of values for the loan amount variable "LOAN"?
  2. How does the distribution of years at present job "YOJ" vary across the dataset?
  3. How many unique categories are there in the REASON variable?
  4. What is the most common category in the JOB variable?
  5. Is there a relationship between the REASON variable and the proportion of applicants who defaulted on their loan?
  6. Do applicants who default have a significantly different loan amount compared to those who repay their loan?
  7. Is there a correlation between the value of the property and the loan default rate?
  8. Do applicants who default have a significantly different mortgage amount compared to those who repay their loan?

Univariate Analysis¶

In [18]:
# Plot distribution and boxplots for numerical variables
for col in num_data:
    print(col)

    print('Skew :', round(data[col].skew(), 2))

    plt.figure(figsize = (15, 4))

    plt.subplot(1,2,1)

    data[col].hist(bins = 10, grid = False)

    plt.ylabel('count')

    plt.subplot(1, 2, 2)

    sns.boxplot(x = data[col])

    plt.show()
LOAN
Skew : 2.02
MORTDUE
Skew : 1.94
VALUE
Skew : 3.09
YOJ
Skew : 1.09
DEROG
Skew : 5.69
DELINQ
Skew : 4.25
CLAGE
Skew : 1.39
NINQ
Skew : 2.77
CLNO
Skew : 0.8
DEBTINC
Skew : 3.11

Observations:

asdf

In [19]:
# Bar plot of the DELINQ and its count
plt.figure(figsize = (10, 6))

ax = sns.countplot(x = 'DELINQ', data = data)
# Get the number count for each column
for p in ax.patches:
    ax.annotate('{:.1f}'.format(p.get_height()), (p.get_x(), p.get_height()))
plt.show()
In [20]:
# Bar plot of each categorical variable, with the status distinction.
for col in cat_data:
    print(col)

    plt.figure(figsize = (10, 6))

    ax = sns.countplot(x = col, data = data)

    # Get the number count for each column
    for p in ax.patches:
        ax.annotate('{:.1f}'.format(p.get_height()), (p.get_x(), p.get_height()))

    plt.show()
BAD
REASON
JOB
In [20]:
 

Bivariate Analysis¶

In [21]:
# Getting boxplots for each numerical feature by status
for col in num_data:
    print(col)
    plt.figure(figsize = (10, 5))

    sns.boxplot(y=data[col], x=data["BAD"])

    plt.show()
LOAN
MORTDUE
VALUE
YOJ
DEROG
DELINQ
CLAGE
NINQ
CLNO
DEBTINC
In [21]:
 
In [22]:
# Bar plot of each categorical variable, with the status distinction.
for col in cat_data:
    print(col)

    plt.figure(figsize = (10, 6))

    ax = sns.countplot(x = col, hue = 'BAD', data = data)

    # Get the number count for each column
    for p in ax.patches:
        ax.annotate('{:.1f}'.format(p.get_height()), (p.get_x(), p.get_height()))

    plt.show()
BAD
REASON
JOB

def stacked_plot(x): sns.set(palette='nipy_spectral') tab1 = pd.crosstab(x,data['BAD'],margins=True) print(tab1) print('-'*120) tab = pd.crosstab(x,data['BAD'],normalize='index') tab.plot(kind='bar',stacked=True,figsize=(10,5)) plt.legend(loc='lower left', frameon=False) plt.legend(loc="upper left", bbox_to_anchor=(1,1)) plt.show()

Plot stacked bar plot for BAD and REASON stacked_plot(data['REASON'])

Multivariate Analysis¶

In [23]:
# Correlations between all the variables using a heatmap
plt.figure(figsize = (12, 7))

sns.heatmap(data.corr(), annot = True)

plt.show()
In [24]:
# create a pair plot by status to find any relationships
#sns.pairplot(data, hue ='BAD')
In [24]:
 

Treating Outliers¶

In [25]:
# Make a list of the features that will have the outlier treatment
outlier_list = [
 'LOAN',
 'MORTDUE',
 'VALUE',
 'YOJ',
 'CLAGE',
 'NINQ',
 'CLNO',
 'DEBTINC']
In [26]:
# Using the IQR to treat outliers
def treat_outliers(df,col):

    Q1=df[col].quantile(0.25) # 25th quantile
    Q3=df[col].quantile(0.75)  # 75th quantile
    IQR=Q3-Q1   # IQR Range
    Lower_Whisker = Q1-IQR
    Upper_Whisker = Q3+IQR
    df[col] = np.clip(df[col], Lower_Whisker, Upper_Whisker) #  used to limit the values with the upper and lower whiskers

    return df

# Using the past function to do it in all the numerical variables
def treat_outliers_all(df, col_list):

    for c in col_list:
        df = treat_outliers(df,c)

    return df


data_no_outlier = data.copy()

numerical_col = outlier_list #list of numerical columns except BAD

df = treat_outliers_all(data_no_outlier,numerical_col)
In [27]:
# Borrowed function from making case studies that will create boxplot and histogram for any input numerical variable.
# This function takes the numerical column as the input and return the boxplots and histograms for the variable.

def histogram_boxplot(feature, figsize=(15,10), bins = None):

    f2, (ax_box2, ax_hist2) = plt.subplots(nrows = 2, # Number of rows of the subplot grid= 2
                                           sharex = True, # x-axis will be shared among all subplots
                                           gridspec_kw = {"height_ratios": (.25, .75)},
                                           figsize = figsize
                                           ) # creating the 2 subplots
    sns.boxplot(feature, ax=ax_box2, showmeans=True, color='pink', orient="h") # boxplot will be created and a star will indicate the mean value of the column
    sns.distplot(feature, kde=F, ax=ax_hist2, bins=bins,palette="summer") if bins else sns.distplot(feature, kde=False, ax=ax_hist2) # For histogram
    ax_hist2.axvline(np.mean(feature), color='red', linestyle='--') # Add mean to the histogram
    ax_hist2.axvline(np.median(feature), color='black', linestyle='-') # Add median to the histogram
In [28]:
# Plot the histogram and boxplot with outliers
for col in num_data:
  histogram_boxplot(data[col])
In [29]:
# Plot the histogram and boxplot without outliers
for col in num_data:
  histogram_boxplot(df[col])

Observations:

The treatment of outliers worked as seen on the previous plots and the data is ready to do the modeling.

In [30]:
# Correlations between all the variables using a heatmap
plt.figure(figsize = (12, 7))

sns.heatmap(data_no_outlier.corr(), annot = True)

plt.show()

Observations:

The correlations across the dataset has one strong positive correlation which is between the value and the mortgage due. There are also other positive correlations but are weaker like: Loan and value, Loan and Mortgage, CLNO and the mort and the value, DEROG and BAD, DELINQ and BAD, among others.

In [31]:
# create a pair plot by status to find any relationships
sns.pairplot(data, hue ='BAD')
Out[31]:
<seaborn.axisgrid.PairGrid at 0x7f294a1a4c70>

Important Insights from EDA¶

What are the the most important observations and insights from the data based on the EDA performed?

Leading Questions:

  1. What is the range of values for the loan amount variable "LOAN"?

The range of the loans are between 1,100 to 89,900

  1. How does the distribution of years at present job "YOJ" vary across the dataset?

The distribution of the years on the job show that there are more customers with less than 7 years on the job.

  1. How many unique categories are there in the REASON variable?

Two, DebtCon and Home improvement

  1. What is the most common category in the JOB variable?

The most common variable is Other.

  1. Is there a relationship between the REASON variable and the proportion of applicants who defaulted on their loan?

The more defaults come from DebtCon. But proportionally there's not much difference.

  1. Do applicants who default have a significantly different loan amount compared to those who repay their loan?

Applicants with larger loans don't default. And the applicants with lower loans tend to default more.

  1. Is there a correlation between the value of the property and the loan default rate?

Yes there is a weak positive correlation between the loan and the value of the property.

  1. Do applicants who default have a significantly different mortgage amount compared to those who repay their loan?

No, there's no significant difference.

Model Building - Approach¶

  • Data preparation
  • Partition the data into train and test set
  • Build the model
  • Fit on the train data
  • Tune the model
  • Test the model on test set
In [32]:
# Data Preparation
#clean data set for logistic regression
df_LR = df.copy()

# A new copy for the modeling
df_ready = df.copy()

# Drop the dependent variable from the dataframe and create the X(independent variable) matrix

X = df_ready.drop(columns = 'BAD')

# Create dummy variables for the categorical variables - Hint: use the get_dummies() function

X = pd.get_dummies(X, drop_first = True)

# Create y(dependent varibale)

Y = df_ready['BAD']
In [33]:
# Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.30, random_state = 1)

Logistic Regression¶

In [34]:
#Split the data into training and testing
# 70/30 split

# All rows where 'BAD' column is 1
input_ones = df_LR[df_LR['BAD'] == 1]
# All rows where 'BAD' column is 0
input_zeros = df_LR[df_LR['BAD'] == 0]

# For repeatability of sample
np.random.seed(100)

input_ones_training_rows = np.random.choice(input_ones.index, int(0.7 * len(input_ones)), replace=False)
input_zeros_training_rows = np.random.choice(input_zeros.index, int(0.7 * len(input_zeros)), replace=False)

# Pick as many 0 and 1
training_ones = input_ones.loc[input_ones_training_rows]
training_zeros = input_zeros.loc[input_zeros_training_rows]

# Concatenate
trainingData = pd.concat([training_ones, training_zeros])

# Create test data
test_ones = input_ones.drop(input_ones_training_rows)
test_zeros = input_zeros.drop(input_zeros_training_rows)

# Concatenate
testData = pd.concat([test_ones, test_zeros])
In [35]:
#check for imbalance
bad_counts = trainingData['BAD'].value_counts()

print(bad_counts)
print('____________________')

# check class distrubution
class_distribution = trainingData['BAD'].value_counts(normalize=True)

print(class_distribution)
0    3339
1     832
Name: BAD, dtype: int64
____________________
0    0.800527
1    0.199473
Name: BAD, dtype: float64
In [36]:
# Visualize class distribution using a bar plot
plt.figure(figsize=(8, 6))
sns.barplot(x=class_distribution.index, y=class_distribution.values)
plt.xlabel('0: Payed, 1: Defaulted')
plt.ylabel('Ratio')
plt.title('Class Distribution')
plt.show()

Observations:

Looking at the barplot the distribution is clearly from 0 to 0.8 for the customers that pay and 0 to 0.2 for the customers that default.

Model¶

In [37]:
# Function to print classification report and get confusion matrix in a proper format

def metrics_score(actual, predicted):
    print(classification_report(actual, predicted))

    cm = confusion_matrix(actual, predicted)

    plt.figure(figsize = (8, 5))

    sns.heatmap(cm, annot = True,  fmt = '.2f', xticklabels = ['Not Eligible', 'Eligible'], yticklabels = ['Not Eligible', 'Eligible'])

    plt.ylabel('Actual')

    plt.xlabel('Predicted')

    plt.show()
In [38]:
#Create a table to add the results of the model
results = pd.DataFrame(columns = ['Model_Name','Train_f1','Train_recall','Train_precision','Test_f1','Test_recall','Test_precision'])

results.head()
Out[38]:
Model_Name Train_f1 Train_recall Train_precision Test_f1 Test_recall Test_precision

Decision Tree¶

In [39]:
#Defining Decision tree model using the previous class weights for imbalances

d_tree_base = DecisionTreeClassifier(random_state = 7, class_weight = {0: 0.2, 1: 0.8})
In [40]:
# Fitting the decision tree classifier on the training data
d_tree =  DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8} ,random_state = 7)

d_tree.fit(X_train, y_train)
Out[40]:
DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8}, random_state=7)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8}, random_state=7)
In [41]:
# Checking performance on the training data
y_pred_train_d_tree = d_tree.predict(X_train)

metrics_score(y_train, y_pred_train_d_tree)
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3355
           1       1.00      1.00      1.00       817

    accuracy                           1.00      4172
   macro avg       1.00      1.00      1.00      4172
weighted avg       1.00      1.00      1.00      4172

In [42]:
# Checking performance on the testing data
y_pred_test_d_tree = d_tree.predict(X_test)

metrics_score(y_test, y_pred_test_d_tree)
              precision    recall  f1-score   support

           0       0.90      0.93      0.91      1416
           1       0.69      0.60      0.64       372

    accuracy                           0.86      1788
   macro avg       0.79      0.76      0.78      1788
weighted avg       0.85      0.86      0.86      1788

In [43]:
# Adding the results to the table

new_row = {'Model_Name': 'd_tree_base',
           'Train_f1': 100,
           'Train_recall': 100,
           'Train_precision':100,
           'Test_f1': 64,
           'Test_recall': 60,
           'Test_precision': 69}

results = pd.concat([results, pd.DataFrame([new_row])], ignore_index=True)

# Print the updated DataFrame
print(results)
    Model_Name Train_f1 Train_recall Train_precision Test_f1 Test_recall  \
0  d_tree_base      100          100             100      64          60   

  Test_precision  
0             69  

Decision Tree - Hyperparameter Tuning¶

  • Hyperparameter tuning is tricky in the sense that there is no direct way to calculate how a change in the hyperparameter value will reduce the loss of your model, so we usually resort to experimentation. We'll use Grid search to perform hyperparameter tuning.
  • Grid search is a tuning technique that attempts to compute the optimum values of hyperparameters.
  • It is an exhaustive search that is performed on the specific parameter values of a model.
  • The parameters of the estimator/model used to apply these methods are optimized by cross-validated grid-search over a parameter grid.

Criterion {“gini”, “entropy”}

The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.

max_depth

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_leaf

The minimum number of samples is required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

You can learn about more Hyperpapameters on this link and try to tune them.

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

In [44]:
# Choose the type of classifier.
d_tree_tuned = DecisionTreeClassifier(random_state = 7, class_weight = {0: 0.2, 1: 0.8})

# Grid of parameters to choose from
parameters = {'max_depth': np.arange(2, 6), #[2, 3, 4, 5 ]
              'criterion': ['gini', 'entropy'], #use both gini and entropy to measure split quality
              'min_samples_leaf': [5, 10, 20, 25] #minimum number of samples to be a leaf node
             }


# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(recall_score, pos_label = 1)

# Run the grid search
grid_obj = GridSearchCV(d_tree_tuned, parameters, scoring = scorer, cv = 5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the classifier to the best combination of parameters
d_tree_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data
d_tree_tuned.fit(X_train, y_train)
Out[44]:
DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8}, criterion='entropy',
                       max_depth=5, min_samples_leaf=10, random_state=7)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8}, criterion='entropy',
                       max_depth=5, min_samples_leaf=10, random_state=7)
In [45]:
# Checking performance on the training data based on the tuned model
y_pred_train_d_tree_tuned = d_tree_tuned.predict(X_train)

metrics_score(y_train,y_pred_train_d_tree_tuned)
              precision    recall  f1-score   support

           0       0.94      0.90      0.92      3355
           1       0.66      0.77      0.71       817

    accuracy                           0.88      4172
   macro avg       0.80      0.84      0.82      4172
weighted avg       0.89      0.88      0.88      4172

In [46]:
# Checking performance on the testing data based on the tuned model
y_pred_test_d_tree_tuned = d_tree_tuned.predict(X_test)

metrics_score(y_test,y_pred_test_d_tree_tuned)
              precision    recall  f1-score   support

           0       0.92      0.91      0.92      1416
           1       0.68      0.72      0.70       372

    accuracy                           0.87      1788
   macro avg       0.80      0.81      0.81      1788
weighted avg       0.87      0.87      0.87      1788

In [47]:
# Adding the results to the table

new_row = {'Model_Name': 'd_tree_base_tuned',
           'Train_f1': 71,
           'Train_recall': 77,
           'Train_precision':66,
           'Test_f1': 70,
           'Test_recall': 72,
           'Test_precision': 68}

results = pd.concat([results, pd.DataFrame([new_row])], ignore_index=True)

# Print the updated DataFrame
print(results)
          Model_Name Train_f1 Train_recall Train_precision Test_f1  \
0        d_tree_base      100          100             100      64   
1  d_tree_base_tuned       71           77              66      70   

  Test_recall Test_precision  
0          60             69  
1          72             68  
In [48]:
# Plot the decision  tree and analyze it to build the decision rule
features = list(X.columns)

plt.figure(figsize = (40,40))

tree.plot_tree(d_tree_tuned, feature_names = features, filled = True, fontsize = 10, node_ids = True, class_names = True)

plt.show()
In [49]:
# Plotting the feature importance
importances = d_tree_tuned.feature_importances_

indices = np.argsort(importances)

plt.figure(figsize = (15, 10))

plt.title('Most Important Features')

plt.barh(range(len(indices)), importances[indices], color = 'blue')

plt.yticks(range(len(indices)), [features[i] for i in indices])

plt.xlabel('Importance Ratio')

plt.show()

Observations:

The main features that involve having a default are: DEBTINC, DELINQ, CLAGE, MORTDUE.

Building a Random Forest Classifier¶

Random Forest is a bagging algorithm where the base models are Decision Trees. Samples are taken from the training data and on each sample a decision tree makes a prediction.

The results from all the decision trees are combined together and the final prediction is made using voting or averaging.

In [50]:
# Defining Random forest CLassifier
rf_estimator = RandomForestClassifier(random_state=7,criterion="entropy")

rf_estimator.fit(X_train,y_train)
Out[50]:
RandomForestClassifier(criterion='entropy', random_state=7)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(criterion='entropy', random_state=7)
In [51]:
#Checking performance on the training data
y_pred_train_rf = rf_estimator.predict(X_train)

metrics_score(y_train,y_pred_train_rf)
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3355
           1       1.00      1.00      1.00       817

    accuracy                           1.00      4172
   macro avg       1.00      1.00      1.00      4172
weighted avg       1.00      1.00      1.00      4172

In [52]:
# Checking performance on the test data
y_pred_test_rf = rf_estimator.predict(X_test)

metrics_score(y_test, y_pred_test_rf)
              precision    recall  f1-score   support

           0       0.92      0.98      0.95      1416
           1       0.88      0.66      0.76       372

    accuracy                           0.91      1788
   macro avg       0.90      0.82      0.85      1788
weighted avg       0.91      0.91      0.91      1788

In [53]:
# Adding the results to the table

new_row = {'Model_Name': 'Random Forest Classifier',
           'Train_f1': 100,
           'Train_recall': 100,
           'Train_precision':100,
           'Test_f1': 76,
           'Test_recall': 66,
           'Test_precision': 88}

results = pd.concat([results, pd.DataFrame([new_row])], ignore_index=True)

# Print the updated DataFrame
print(results)
                 Model_Name Train_f1 Train_recall Train_precision Test_f1  \
0               d_tree_base      100          100             100      64   
1         d_tree_base_tuned       71           77              66      70   
2  Random Forest Classifier      100          100             100      76   

  Test_recall Test_precision  
0          60             69  
1          72             68  
2          66             88  
In [53]:
 

Random Forest Classifier Hyperparameter Tuning¶

In [54]:
# Choose the type of classifier
rf_estimator_tuned = RandomForestClassifier(criterion = "entropy", random_state = 7)

# Grid of parameters to choose from
parameters = {"n_estimators": [100, 110],
    "max_depth": [5,6],
    "max_leaf_nodes": [8,10],
    "min_samples_split":[20],
    'criterion': ['gini'],
    "max_features": ['sqrt'],
    "class_weight": ["balanced",{0: 0.2, 1: 0.8}]
             }

# Type of scoring used to compare parameter combinations - recall score for class 1
scorer = metrics.make_scorer(recall_score, pos_label = 1)

# Run the grid search
grid_obj = GridSearchCV(rf_estimator_tuned, parameters, scoring = scorer, cv = 5)

grid_obj = grid_obj.fit(X_train, y_train)

# Set the classifier to the best combination of parameters
rf_estimator_tuned = grid_obj.best_estimator_
# Fitting the best algorithm to the training data
rf_estimator_tuned.fit(X_train, y_train)
Out[54]:
RandomForestClassifier(class_weight='balanced', max_depth=6, max_leaf_nodes=8,
                       min_samples_split=20, n_estimators=110, random_state=7)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(class_weight='balanced', max_depth=6, max_leaf_nodes=8,
                       min_samples_split=20, n_estimators=110, random_state=7)
In [55]:
# Checking performance on the training data
y_pred_train_rf_tuned = rf_estimator_tuned.predict(X_train)

metrics_score(y_train, y_pred_train_rf_tuned)
              precision    recall  f1-score   support

           0       0.95      0.83      0.88      3355
           1       0.53      0.82      0.65       817

    accuracy                           0.82      4172
   macro avg       0.74      0.82      0.77      4172
weighted avg       0.87      0.82      0.84      4172

In [56]:
# Checking performance on the training data
y_pred_test_rf_tuned = rf_estimator_tuned.predict(X_test)

metrics_score(y_test, y_pred_test_rf_tuned)
              precision    recall  f1-score   support

           0       0.94      0.83      0.88      1416
           1       0.55      0.79      0.65       372

    accuracy                           0.82      1788
   macro avg       0.74      0.81      0.76      1788
weighted avg       0.86      0.82      0.83      1788

In [57]:
# Plot the most important features
importances = rf_estimator_tuned.feature_importances_

indices = np.argsort(importances)

feature_names = list(X.columns)

plt.figure(figsize = (5, 5))

plt.title('Feature Importances')

plt.barh(range(len(indices)), importances[indices], color = 'blue')

plt.yticks(range(len(indices)), [feature_names[i] for i in indices])

plt.xlabel('Relative Importance')

plt.show()
In [58]:
# Adding the results to the table

new_row = {'Model_Name': 'Random Forest Classifier Tuned',
           'Train_f1': 65,
           'Train_recall': 82,
           'Train_precision':53,
           'Test_f1': 65,
           'Test_recall': 79,
           'Test_precision': 55}

results = pd.concat([results, pd.DataFrame([new_row])], ignore_index=True)

# Print the updated DataFrame
print(results)
results
                       Model_Name Train_f1 Train_recall Train_precision  \
0                     d_tree_base      100          100             100   
1               d_tree_base_tuned       71           77              66   
2        Random Forest Classifier      100          100             100   
3  Random Forest Classifier Tuned       65           82              53   

  Test_f1 Test_recall Test_precision  
0      64          60             69  
1      70          72             68  
2      76          66             88  
3      65          79             55  
Out[58]:
Model_Name Train_f1 Train_recall Train_precision Test_f1 Test_recall Test_precision
0 d_tree_base 100 100 100 64 60 69
1 d_tree_base_tuned 71 77 66 70 72 68
2 Random Forest Classifier 100 100 100 76 66 88
3 Random Forest Classifier Tuned 65 82 53 65 79 55

1. Comparison of various techniques and their relative performance based on chosen Metric (Measure of success):

  • How do different techniques perform? Which one is performing relatively better? Is there scope to improve the performance further?

The best technique was the Random Forest Tuned, having a 79% of recall but 55% of precision.

2. Refined insights:

  • What are the most meaningful insights relevant to the problem?

The most important insights is that DEBTINC, DELINQ, DEROG, and CLAGE are the main feaures. These features appeared to be relevantin both the decision tree and the random forest.

3. Proposal for the final solution design:

  • What model do you propose to be adopted? Why is this the best solution to adopt?

The best model to adopt is the Random Forest tuned, as this is the model that has the recall more maximized. Even though F-1 is not as high in this dataset is more beneficial to maximize the recall.

In [58]: